INTRODUCTION

The COVID-19 pandemic has reached almost every country during 2020. As an effect of extreme social-distancing measures adopted by governments to fight an increasing sanitary crisis, an unprecedent shutdown of demand and supply chains drove world economies to recession.

In order to quantify impact COVID-19, this study will use IMF estimated figures for GDP real growth for over 180 countries in 2020. It is known that this indicator is normally affected by a multitude of national and international factors, but the magnitude of the recession caused by the pandemic, as well as its global reach, makes it an adequate response variable to represent impact.


The data to be explored includes 20 factors extracted from ourworldindata.com, varying from general demographic figures to more COVID-related statistics. A database with 8 additional features from an IMF research is also used to include information on fiscal and monetary stimulus announced by several countries in response to COVID.

# Check all features to be used as predictors
Features[-c(1,2)]
##  [1] "total_cases_per_million"            
##  [2] "total_deaths_per_million"           
##  [3] "icu_patients_per_million"           
##  [4] "hosp_patients_per_million"          
##  [5] "total_tests_per_thousand"           
##  [6] "positive_rate"                      
##  [7] "total_vaccinations_per_hundred"     
##  [8] "people_fully_vaccinated_per_hundred"
##  [9] "stringency_index"                   
## [10] "population_density"                 
## [11] "median_age"                         
## [12] "gdp_per_capita"                     
## [13] "extreme_poverty"                    
## [14] "cardiovasc_death_rate"              
## [15] "diabetes_prevalence"                
## [16] "female_smokers"                     
## [17] "male_smokers"                       
## [18] "hospital_beds_per_thousand"         
## [19] "life_expectancy"                    
## [20] "human_development_index"            
## [21] "Fiscal.additional.subtotal"         
## [22] "Fiscal.health"                      
## [23] "Fiscal.non.health"                  
## [24] "Fiscal.accelerated.or.deferredrev"  
## [25] "Liquidity.subtotal"                 
## [26] "Liquidity.below.the.line"           
## [27] "Liquidity.guarantees"               
## [28] "Liquidity.quasi.fiscal"

After proper cleaning and structuring, different statistical methods are applied to data and results are analyzed. For practical purposes, this paper presents results of each step as it describes methodology and justifies the use of a particular approach.

Immunization programs, despite crucial for understanding the most recent scenario, only started in the second half of December and thus its data cannot be used to model GDP real growth in 2020. For this reason, vaccination figures are explored separately and modelled with the use of a different statistical method.

DATA CLEANING AND STRUCTURING

The major challenge of dealing with the available data is the large number of missing data, for both country figures (rows) and features (columns). Although some advanced machine learning algorithms can fit a regression model with missing data, linear correlations cannot be analyzed when that is the case.

For this reason, the database is modified through the following steps:

Notes:

# Check number of countries left in data
nrow(impTrain)
## [1] 127
# Check number of features left in data
ncol(impTrain)
## [1] 20
# Overview of train data after cleaning and imputation
Data[index,1] %>% cbind(impTrain) %>% head(5)
##            . total_cases_per_million total_deaths_per_million
## 159 Zimbabwe                 932.993                 24.42300
## 14    Belize               27101.184                623.71000
## 50  Eswatini                8066.101                176.69900
## 118 Portugal               40569.764                677.27700
## 43  Dominica                1222.375                 76.25451
##     positive_rate stringency_index population_density median_age
## 159       0.09500         74.96856             42.729     19.600
## 14        0.23897         72.13777             16.426     25.000
## 50        0.11502         69.60810             79.492     21.500
## 118       0.12600         68.05097            112.371     46.200
## 43        0.08183         43.46094             98.567     30.555
##     gdp_per_capita extreme_poverty cardiovasc_death_rate
## 159       1899.775          21.400               307.846
## 14        7824.362           5.636               176.957
## 50        7738.975          16.473               333.436
## 118      27936.896           0.500               127.842
## 43        9673.367           5.541               227.376
##     diabetes_prevalence female_smokers male_smokers
## 159                1.82          1.600       30.700
## 14                17.11          7.221       25.251
## 50                 3.94          1.700       16.500
## 118                9.85         16.300       30.000
## 43                11.62          2.830       31.114
##     hospital_beds_per_thousand life_expectancy human_development_index
## 159                       1.70           61.49                   0.571
## 14                        1.30           74.62                   0.716
## 50                        2.10           60.19                   0.611
## 118                       3.39           82.05                   0.864
## 43                        3.80           75.00                   0.742
##     Fiscal.additional.subtotal Fiscal.health Fiscal.non.health
## 159                  4.8015925     0.1400995          4.661493
## 14                   0.7497281     0.5086054          1.329882
## 50                   2.8361923     0.3712294          2.464963
## 118                  4.7257658     1.0943879          3.631378
## 43                   1.9141434     0.0000000          1.914143
##     Liquidity.subtotal     y
## 159           1.435653  -8.0
## 14            3.298804 -14.1
## 50            1.627220  -3.3
## 118           6.466837  -7.6
## 43            0.708942 -10.4

FEATURE ANALYSIS AND MODELLING

Feature anaysis and modelling will filter out predictors with weak impact on response variable (GDP real growth).

Correlation

A table is created with Pearson correlation values between features and response variable, along with p-values of a correlation test. It is important to mention that these results do not imply causation, but simply a relationship or pattern between the values.

Results:

# Check correlation to response variable
Correlation
##                       Feature  Correlation        Pvalue
## 2         diabetes_prevalence -0.352922233 0.00004708259
## 3             extreme_poverty  0.346623848 0.00006548532
## 10    human_development_index -0.306317930 0.00046094598
## 19   total_deaths_per_million -0.289934166 0.00094501749
## 11            life_expectancy -0.274039606 0.00182303715
## 14                 median_age -0.273703098 0.00184781600
## 12         Liquidity.subtotal -0.259828791 0.00317762548
## 7           Fiscal.non.health -0.214229114 0.01558389940
## 17           stringency_index -0.194875963 0.02812626336
## 5  Fiscal.additional.subtotal -0.194398704 0.02852142883
## 18    total_cases_per_million -0.188435857 0.03387026739
## 9  hospital_beds_per_thousand -0.139501030 0.11776788190
## 4              female_smokers -0.139190644 0.11859527173
## 1       cardiovasc_death_rate  0.122074641 0.17155369905
## 15         population_density -0.079802210 0.37246973129
## 16              positive_rate -0.078655798 0.37939886436
## 8              gdp_per_capita -0.069258868 0.43909618951
## 13               male_smokers -0.057771625 0.51882668429
## 6               Fiscal.health  0.003751348 0.96661214473
## 20                          y  1.000000000            NA

Note that 11 out of 19 predictors have a p-value lower than 0.05 in a correlation test, thus presenting sufficient evidence to conclude that there is a significant linear relationship (with alpha = 0.05).

Keep in mind these correlation coefficients are calculated with imputed data, which can affect results. Another limitation of this analysis is that it does not account for non-linear relationships.

Xgboost

The Xgboost is a gradient-boosting algorithm often used in machine learning solutions. The advantages of applying this method in variable selection include the flexibility of modelling with missing data, as well as the fact that it is able to handle co-dependence between predictors - while linear models are not.

A xgboost model is fitted on train data, and parameters are adjusted in order to minimize error observed after prediction in test data. The error metric used in this analysis is the root mean squared error (RMSE).

# Check RMSE of test data
sqrt(mean((Test$y-Pred)^2))
## [1] 5.932597

Compare predictor importance

Correlation results (Correlation and Pvalue) are complemented with xgboost outputs for variable importance in a single table. The column NAs indicates the percentage of missing records in training data, while VarImp and ImpAboveMean are related to the improvement in accuracy brought by each feature in the xboost model.

Results:

# Compare variable importance
FeaturesImp
##                       Feature Correlation Pvalue  NAs VarImp ImpAboveMean
## 2         diabetes_prevalence       -0.35   0.00 0.00   0.10         TRUE
## 3             extreme_poverty        0.35   0.00 0.28   0.05        FALSE
## 10    human_development_index       -0.31   0.00 0.00   0.24         TRUE
## 19   total_deaths_per_million       -0.29   0.00 0.06   0.09         TRUE
## 11            life_expectancy       -0.27   0.00 0.00   0.00        FALSE
## 14                 median_age       -0.27   0.00 0.01   0.02        FALSE
## 12         Liquidity.subtotal       -0.26   0.00 0.36   0.00        FALSE
## 7           Fiscal.non.health       -0.21   0.02 0.11   0.04        FALSE
## 17           stringency_index       -0.19   0.03 0.00   0.01        FALSE
## 5  Fiscal.additional.subtotal       -0.19   0.03 0.04   0.01        FALSE
## 18    total_cases_per_million       -0.19   0.03 0.01   0.06         TRUE
## 9  hospital_beds_per_thousand       -0.14   0.12 0.09   0.02        FALSE
## 4              female_smokers       -0.14   0.12 0.16   0.08         TRUE
## 1       cardiovasc_death_rate        0.12   0.17 0.01   0.02        FALSE
## 15         population_density       -0.08   0.37 0.01   0.02        FALSE
## 16              positive_rate       -0.08   0.38 0.36   0.04        FALSE
## 8              gdp_per_capita       -0.07   0.44 0.00   0.09         TRUE
## 13               male_smokers       -0.06   0.52 0.17   0.06         TRUE
## 6               Fiscal.health        0.00   0.97 0.11   0.05        FALSE

Among all 11 predicts with a p-value lower than 0.05 in the correlation test, only 4 are above average in variable importance to the xgboost model. Note that stringency index, which is a metric directly related to pandemic response by governments, has no missing values and a p-value of 0.03. Its VarImp, however, is one of the lowest in record.

The next step is to eliminate features with weak results. Only variables with a p-value lower than alpha (0.05) or with above-average VarImp are selected:

# Check features left in data after selection
newTrain %>% names()
##  [1] "total_cases_per_million"    "total_deaths_per_million"  
##  [3] "stringency_index"           "median_age"                
##  [5] "gdp_per_capita"             "extreme_poverty"           
##  [7] "diabetes_prevalence"        "female_smokers"            
##  [9] "male_smokers"               "life_expectancy"           
## [11] "human_development_index"    "Fiscal.additional.subtotal"
## [13] "Fiscal.non.health"          "Liquidity.subtotal"        
## [15] "y"

Correlation among features and data transformations

If a linear model is to be considered, the issue of collinearity among predictors must be addressed. A correlation table summarizes these relationships found in data, indicating correlation coefficient and the p-value status for each case (p < .0001 ‘* * * *’; p < .001 ‘* * *’, p < .01 ‘* *’, p < .05 ‘*’).

Results:

# Show feature correlation table
FeatCorrelation
##                            total_cases_per_million
## total_cases_per_million                           
## total_deaths_per_million                  0.81****
## stringency_index                          0.28**  
## median_age                                0.64****
## gdp_per_capita                            0.61****
## extreme_poverty                          -0.47****
## diabetes_prevalence                       0.02    
## female_smokers                            0.64****
## male_smokers                              0.05    
## life_expectancy                           0.58****
## human_development_index                   0.64****
## Fiscal.additional.subtotal                0.24**  
## Fiscal.non.health                         0.24**  
## Liquidity.subtotal                        0.25**  
##                            total_deaths_per_million stringency_index
## total_cases_per_million                                             
## total_deaths_per_million                                            
## stringency_index                           0.30***                  
## median_age                                 0.59****         0.15    
## gdp_per_capita                             0.38****         0.13    
## extreme_poverty                           -0.43****        -0.29*** 
## diabetes_prevalence                       -0.07             0.05    
## female_smokers                             0.68****         0.06    
## male_smokers                               0.02            -0.03    
## life_expectancy                            0.52****         0.19*   
## human_development_index                    0.56****         0.23**  
## Fiscal.additional.subtotal                 0.30***         -0.01    
## Fiscal.non.health                          0.28**           0.01    
## Liquidity.subtotal                         0.33***         -0.04    
##                            median_age gdp_per_capita extreme_poverty
## total_cases_per_million                                             
## total_deaths_per_million                                            
## stringency_index                                                    
## median_age                                                          
## gdp_per_capita               0.67****                               
## extreme_poverty             -0.75****      -0.53****                
## diabetes_prevalence          0.11           0.20*          -0.32*** 
## female_smokers               0.70****       0.47****       -0.38****
## male_smokers                 0.14          -0.10           -0.18*   
## life_expectancy              0.87****       0.71****       -0.78****
## human_development_index      0.92****       0.78****       -0.81****
## Fiscal.additional.subtotal   0.49****       0.35****       -0.30*** 
## Fiscal.non.health            0.53****       0.40****       -0.35****
## Liquidity.subtotal           0.49****       0.31***        -0.28**  
##                            diabetes_prevalence female_smokers male_smokers
## total_cases_per_million                                                   
## total_deaths_per_million                                                  
## stringency_index                                                          
## median_age                                                                
## gdp_per_capita                                                            
## extreme_poverty                                                           
## diabetes_prevalence                                                       
## female_smokers                       -0.14                                
## male_smokers                          0.14           0.14                 
## life_expectancy                       0.25**         0.55****     0.07    
## human_development_index               0.20*          0.62****     0.08    
## Fiscal.additional.subtotal           -0.06           0.43****     0.09    
## Fiscal.non.health                    -0.02           0.45****     0.04    
## Liquidity.subtotal                    0.07           0.33***      0.01    
##                            life_expectancy human_development_index
## total_cases_per_million                                           
## total_deaths_per_million                                          
## stringency_index                                                  
## median_age                                                        
## gdp_per_capita                                                    
## extreme_poverty                                                   
## diabetes_prevalence                                               
## female_smokers                                                    
## male_smokers                                                      
## life_expectancy                                                   
## human_development_index           0.93****                        
## Fiscal.additional.subtotal        0.45****                0.49****
## Fiscal.non.health                 0.49****                0.54****
## Liquidity.subtotal                0.42****                0.41****
##                            Fiscal.additional.subtotal Fiscal.non.health
## total_cases_per_million                                                
## total_deaths_per_million                                               
## stringency_index                                                       
## median_age                                                             
## gdp_per_capita                                                         
## extreme_poverty                                                        
## diabetes_prevalence                                                    
## female_smokers                                                         
## male_smokers                                                           
## life_expectancy                                                        
## human_development_index                                                
## Fiscal.additional.subtotal                                             
## Fiscal.non.health                            0.97****                  
## Liquidity.subtotal                           0.41****          0.42****

Principal component analysis is one of the appropriate methods of solving the collinearity among variables. It is applied to 4 groups of predictors and only the first component (PC1) of each group will be used in final database:

  • General demographics: median_age,extreme_poverty, life_expectancy, human_development_index and gdp_per_capita
  • COVID cases and deaths: total_cases_per_million and total_deaths_per_million
  • Stimulus: Liquidity.subtotal, Fiscal.additional.subtotal, Fiscal.non.health and Fiscal.health
  • Smokers: female_smokers and male_smokers

At this point data had a decrease in features from 28 to 6:

# Check features in final data
newTrain %>% names()
## [1] "stringency_index"    "diabetes_prevalence" "y"                  
## [4] "Demographics"        "Cases_and_deaths"    "Stimulus"           
## [7] "Smokers"

Linear model

The main goal of fitting a linear model is not to generate predictions, but to analyze its outputs:

  • R-squared, representing the proportion of variance accounted for the model
  • Beta coefficients, representing the size and direction of the variable effect on GDP real growth
  • Coefficient p-values, testing the null hypothesis that the coefficient is equal to zero (no effect on GDP real growth)

Results:

# Show linear model output
summary(LinearModel)
## 
## Call:
## lm(formula = y ~ ., data = newTrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0074  -2.0151   0.0972   2.2746  10.6906 
## 
## Coefficients:
##                        Estimate  Std. Error t value   Pr(>|t|)    
## (Intercept)          1.85097841  1.66502280   1.112   0.268497    
## stringency_index    -0.05282832  0.02733710  -1.932   0.055657 .  
## diabetes_prevalence -0.41628165  0.08563977  -4.861 0.00000358 ***
## Demographics         0.00006253  0.00002251   2.778   0.006353 ** 
## Cases_and_deaths    -0.00005661  0.00002588  -2.187   0.030673 *  
## Stimulus             0.21890758  0.05775312   3.790   0.000237 ***
## Smokers              0.01389030  0.02813462   0.494   0.622415    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.823 on 120 degrees of freedom
## Multiple R-squared:  0.2803, Adjusted R-squared:  0.2443 
## F-statistic:  7.79 on 6 and 120 DF,  p-value: 0.0000004338

As indicated by the (adjusted) R-squared metric, this model is only able to explain 24.6% of the variance of GDP real growth of 2020. Despite the low value, fitting a linear model had the main goal of understanding the relationships between factors and response variable.

Smokers is the only variable for which its beta is not significant at an alpha of 0.1. It is eliminated from the database and another model is fitted:

# Show second linear model output
summary(LinearModel)
## 
## Call:
## lm(formula = y ~ ., data = newTrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0047  -2.0270  -0.0081   2.2734  10.8744 
## 
## Coefficients:
##                        Estimate  Std. Error t value   Pr(>|t|)    
## (Intercept)          1.85026227  1.65981081   1.115   0.267170    
## stringency_index    -0.05374555  0.02718853  -1.977   0.050342 .  
## diabetes_prevalence -0.40941825  0.08423943  -4.860 0.00000356 ***
## Demographics         0.00006028  0.00002198   2.743   0.007012 ** 
## Cases_and_deaths    -0.00005367  0.00002511  -2.137   0.034580 *  
## Stimulus             0.21564104  0.05719331   3.770   0.000254 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.811 on 121 degrees of freedom
## Multiple R-squared:  0.2788, Adjusted R-squared:  0.249 
## F-statistic: 9.357 on 5 and 121 DF,  p-value: 0.0000001473
Pred <- predict(LinearModel, newTest)
sqrt(mean((newTest$y-Pred)^2))
## [1] 5.403634

The second model has all coefficients with p-value lower than 0.1 and also shows minor improvement on adjusted R-squared.

As noticed in the coefficient table, all but 2 variables (Demographics and Stimulus) have opposite impact on GDP real growth, that is, response variable tend to increase if they decrease.

Note: RMSE for test data is slightly better than that of the Xgboost model (5.932597).

THE VACCINATION CASE

Vaccination could not be used in modelling since it has only started in the second half of December 2020. Despite of that, data of countries already advanced in the process of immunization is available and can be used to further analyze how it can impact other factors.

Stringency index, a statistically significant coefficient in our linear model (alpha = 0.1), is expected to drop as more people become immune to the virus. It is possible to validate this hypothesis by using VAR/VECM time series modelling, as a cointegration between stringency and vaccination figures could provide favorable evidence.

Israel, a country where oved 60% of its population is already vaccinated, will be our case for this analysis. From its data, 2 time-series are selected: total vaccination and stringency index, starting from December 19th 2020.

# Overview of time series
Data[,c(1,9)] %>% head(5)
##                     total_vaccinations stringency_index
## 2020-12-19 07:00:00                 61         69.71143
## 2020-12-20 07:00:00               7433         70.50571
## 2020-12-21 07:00:00              32304         71.30000
## 2020-12-22 07:00:00              76914         71.69571
## 2020-12-23 07:00:00             139728         72.09143

Stringency index is modified to better show an eventual relationship to total vaccinations: data is smoothed with a one week simple moving average and inverted.

# Compare time series
plot.ts(CompareTimeSeries)

Inverted stringency index apparently follows total vaccination as it increases.

Lag order selection

This relationship between vaccination and stringency index is time-lagged, as one time series must be shifted in time to present correlation to another.

The selection of an optimal lag length to be used in a VAR model estimation can be done in different ways. Plotting a cross correlation function (CCF) chart of the log-transformed time series is useful for a visual approach in determining best lag order:

# Select best lag
ccf(VaccinationTransformed,StringencyTransformed,lag.max = 10)

The blue dotted lines represent limits for a statistically significant correlation. In this case, lags from -7 to -10 periods in total vaccination show relationship to inverted stringency index.

Another method of selecting optimal lag is to calculate different metrics used as criteria. Final Prediction Error (FPE) Akaike (AIC), Hannah-Quinn (HQ) and Schwarz information criteria (SC) are commonly used methodologies for this procedure.

# Check best lag order
vModel
## $selection
## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      8      8      8      8 
## 
## $criteria
##                1         2         3         4         5         6
## AIC(n)  3.414806  3.454960  3.375105  3.417890  3.444158  3.254624
## HQ(n)   3.479591  3.562935  3.526269  3.612243  3.681701  3.535357
## SC(n)   3.575078  3.722080  3.749072  3.898705  4.031821  3.949135
## FPE(n) 30.412296 31.663009 29.242512 30.538683 31.380281 25.996764
##                7         8         9        10
## AIC(n)  3.312524  2.587262  2.630491  2.670644
## HQ(n)   3.636446  2.954374  3.040792  3.124136
## SC(n)   4.113883  3.495469  3.645545  3.792547
## FPE(n) 27.596418 13.394025 14.028312 14.658435

The lag order that minimises each of the criteria indicates the optimal value for estimating a VAR model. Note that all of the 4 methods indicates 8 as optimal, which is used for fitting.

Johansen procedure

A cointegration test is used to establish if there is a correlation between vaccination and stringency index in the long term. The Johansen test (or procedure) is used to analyze cointegration relationship among these variables:

# Test for cointegration
summary(JTest)
## 
## ###################### 
## # Johansen-Procedure # 
## ###################### 
## 
## Test type: trace statistic , with linear trend 
## 
## Eigenvalues (lambda):
## [1] 0.14314925 0.04153513
## 
## Values of teststatistic and critical values of test:
## 
##           test 10pct  5pct  1pct
## r <= 1 |  4.20  6.50  8.18 11.65
## r = 0  | 19.49 15.66 17.95 23.52
## 
## Eigenvectors, normalised to first column:
## (These are the cointegration relations)
## 
##                           TotalVaccination.l8 InverseStringencyIndex.l8
## TotalVaccination.l8                 1.0000000                   1.00000
## InverseStringencyIndex.l8           0.1477296                  -2.62168
## 
## Weights W:
## (This is the loading matrix)
## 
##                          TotalVaccination.l8 InverseStringencyIndex.l8
## TotalVaccination.d              -0.011258817               0.002142991
## InverseStringencyIndex.d         0.001806694               0.005931219

For r = 0, test statistic (19.49) is higher than the 5% significance level, presenting evidence favorable to cointegration at a lag order of 8.

CONCLUSIONS AND PRACTICAL APPLICATIONS

This paper aimed at understanding which factors have heavier influence in 2020 GDP real growth, considering this variable as a representation of COVID-19 impact. Quantitative analysis of features was performed along the study and provided valuable insights on how they relate to response variable and among them.

The use of principal component analysis helped to group correlated predictor into groups of similar contexts. Variables connected to fiscal and monetary stimulus were put together in a single predictor, which came to be significant in our final model and showed direct linear relationship to GPD real growth.

Stringency index, as well as cases and deaths, are other variables in our model directly related to COVID-19 - as opposed to demographics and diabetes prevalence, for example. For countries where the pandemic is still peaking, such as Brazil, stimulus is still an option to respond to COVID-19 impact on the economy. National or regional restrictions may also impact stringency index and number of cases and deaths.

In this scenario, a practical application of the fitted model is to estimate the effect of these factors on the next yearly growth output (2021).

As immunization programs advance, it is expected that governments will start easing social-distancing restrictions. As shown in the Israeli case, there is evidence of cointegration between total vaccinations and stringency index at a lag order of 8 - implying that a change in vaccinated population may have an impact on stringency shortly after one week.

This timeframe cannot be generalized to other countries, especially due to the diverse effectiveness of vaccine products and different approaches to social-distancing restrictions. However, it offers a hint on how vaccination drives to a better context for economic reopening.